Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
debakarr
GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 3 - Classification/Random Forest/[R] Random Forest.ipynb
1009 views
Kernel: R

Random Forest

Data preprocessing

# Importing the dataset dataset = read.csv('Social_Network_Ads.csv') dataset = dataset[3:5]
head(dataset, 10)
# Encoding the target feature as factor dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$Purchased, SplitRatio = 0.80) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)
head(training_set, 10)
head(test_set, 10)
# Feature Scaling* training_set[-3] = scale(training_set[-3]) test_set[-3] = scale(test_set[-3])
head(training_set, 10)
head(test_set, 10)

Fitting Random Forest classifier to the Training set

library(randomForest) classifier = randomForest(x = training_set[-3], y = training_set$Purchased, ntree = 10)

Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
head(y_pred, 10)
head(test_set[3], 10)

Making the Confusion Matrix

cm = table(test_set[, 3], y_pred) cm
y_pred 0 1 0 45 6 1 4 25

classifier made 45 + 25 = 70 correct prediction and 6 + 4 = 10 incoreect predictions.


Visualising the Training set results

library(ElemStatLearn) set = training_set X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01) X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') y_grid = predict(classifier, newdata = grid_set, type = 'class') plot(set[, -3], main = 'Random Forest (Training set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'), col='white') legend("topright", legend = c("0", "1"), pch = 16, col = c('red3', 'green4'))
Image in a Jupyter notebook

Visualising the Test set results

library(ElemStatLearn) set = test_set X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01) X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') y_grid = predict(classifier, newdata = grid_set, type = 'class') plot(set[, -3], main = 'Random Forest (Test set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'), col='white') legend("topright", legend = c("0", "1"), pch = 16, col = c('red3', 'green4'))
Image in a Jupyter notebook

Things to remmember while making Random Forest classifier:

  • For each user there are 10 trees making a prediction whether the user buys the SUV or not. The there is a vote of the majority prediction. If there is more yes, then the region is green and if there is more no, then the region is red.

  • Normally it overfits the data. As you can see in above training set, it tries to catch all the red dots which is in the green region, if we look carefully. Also from the test set we can see that some green points were in the red region on the top right.

  • There is no need to Scale the features as decision tree does not depends on Euclidean distance. We are using Feature Scaling here just to get a plot with better resolution. If you ammit scaling, then for example the above case the vector size will be about 200+ GB which is not possible to plot.